Install the necessary libaries including those used for A1 and A2

if (!requireNamespace("BiocManager", quietly = TRUE))    
  install.packages("BiocManager")
if (!requireNamespace("GEOmetadb", quietly = TRUE))    
  BiocManager::install("GEOmetadb")
## Setting options('download.file.method.GEOquery'='auto')
## Setting options('GEOquery.inmemory.gpl'=FALSE)
if (!requireNamespace("edgeR", quietly = TRUE))
  BiocManager::install("edgeR")
if (!requireNamespace("ggplot2", quietly = TRUE))
  install.packages("ggplot2")
if (!requireNamespace("reshape2", quietly = TRUE))
  BiocManager::install("reshape2")
if (!requireNamespace("HGNChelper", quietly = TRUE))
  BiocManager::install("HGNChelper")
if (!requireNamespace("limma", quietly = TRUE))
  BiocManager::install("limma")
if (!requireNamespace("knitr", quietly = TRUE))
  BiocManager::install("knitr")
if (!requireNamespace("ComplexHeatmap", quietly = TRUE))
  install.packages("ComplexHeatmap")
if (!requireNamespace("circlize", quietly = TRUE))
  install.packages("circlize")
if (!requireNamespace("Biobase", quietly = TRUE))
  BiocManager::install("Biobase")

A1 Summary

In A1, I chose my GEO data set based on an experiment involving human induced pluripotent stem cells of family with autism spectrum disorder (ASD). The study aims to see which genes are upregulated or downregulated in cells of patients with varying degrees of ASD vs control. I worked on normalizing the data and removing outliers for A1.

myGEOID <- 'GSE129806'
suppFiles = GEOquery::getGEOSuppFiles(myGEOID)
fileNames = rownames(suppFiles)
data <- read.delim(fileNames[2], check.names=FALSE, header=TRUE)
colnames(data)[1] <- 'gene_id'
# Remove dates from excel errors
data <- data[!grepl('^[0-9]{1,2}[-][a-zA-Z]{3}$', data$gene_id),]
rownames(data) <- data$gene_id
replacements <- HGNChelper::checkGeneSymbols(data$gene_id)
## Maps last updated on: Thu Oct 24 12:31:05 2019
## Warning in HGNChelper::checkGeneSymbols(data$gene_id): Human gene symbols should
## be all upper-case except for the 'orf' in open reading frames. The case of some
## letters was corrected.
## Warning in HGNChelper::checkGeneSymbols(data$gene_id): x contains non-approved
## gene symbols
# Update the gene id with the newest HGNC symbols if available
data$gene_id <- ifelse(!replacements$Approved & !is.na(replacements$Suggested.Symbol), replacements$Suggested.Symbol, replacements$x)
# Calculate CPM
cpms = edgeR::cpm(data[,3:34])
rownames(cpms) <- data[,1]
# Remove low counts
keep = rowSums(cpms > 1) >= 3
data_filtered = data[keep,]
rows_removed = dim(data)[1]-dim(data_filtered)[1]
# Based on code from lecture 4
samples <- data.frame(lapply(colnames(data)[3:34], FUN=function(x){unlist(strsplit(x, split="_"))[c(1,2)]}))
colnames(samples) = colnames(data)[3:34]
rownames(samples) = c("patients", "cell_type")
samples <- data.frame(t(samples))

# Based on code from lecture 4
data_filtered_matrix <- as.matrix(data_filtered[,3:34])
rownames(data_filtered_matrix) <- data_filtered$gene_id
d = edgeR::DGEList(counts=data_filtered_matrix, group=samples$cell_type)
d = edgeR::calcNormFactors(d)
normalized_counts <- edgeR::cpm(d)

A2 Summary

In A2, I calcualted differential expression of the cleaned data set. I examined up-regulated genes and down-regulated genes, and used g:Profiler to find overlapping gene sets to try and understand the data.

Code summary of A2.

model_design <- model.matrix(~ samples$patients)

# Based off lecture notes

# There was an error citing duplicate row names. We will use make.names to make them unique.
rownames(normalized_counts) = make.names(rownames(normalized_counts), unique=TRUE)

# Calculate expression matrix and minimal set
expression_matrix <- as.matrix(normalized_counts)
minimal_set <- Biobase::ExpressionSet(assayData=expression_matrix)

# Fit the data
fit <- limma::lmFit(minimal_set, model_design)
fit2 <- limma::eBayes(fit,trend=TRUE)

topfit <- limma::topTable(fit2,
                          coef=ncol(model_design),
                          adjust.method = "BH",
                          number = nrow(expression_matrix))
output_hits <- merge(rownames(normalized_counts),
                     topfit,
                     by.y=0, by.x=1,
                     all.y=TRUE)
output_hits <- output_hits[order(output_hits$P.Value),]

The g:Profiler results are shown below

Figure_A2_1

Figure_A2_2

A3

I created a rank file from the output hits with gene symbol and rank score.

output_hits_to_write <- output_hits

# Use signed fold change * -log10pvalue as the rank metric
output_hits_to_write$rank <- sign(output_hits_to_write$logFC) * -log10(output_hits_to_write$P.Value)
all_genes <- output_hits_to_write[order(output_hits_to_write$rank, decreasing = TRUE),]

write.table(x=data.frame(gene_id=all_genes$x, rank=all_genes$rank), file="all_genes.rnk", sep= "\t", row.names = FALSE, col.names = FALSE, quote = FALSE)

Non-threshold Gene Set Enrichment Analysis

  1. I will be using the javaGSEA version 4.1.0 desktop application to do my gene set enrichment analysis because of its simplicity and intuitive GUI (Subramanian et al. 2005). I downloaded the jar application and updated my Java runtime to get the app ready.

Next, I opened the app and selected the tool: Run GSEAPreranked. For the gene sets database I set it to the current Baderlab GeneSet Collection (March 01, 2021) containing GOBP data excluding GO annotations with evidence codes IEA (inferred from electronic annotation) (Merico et al. 2010).

For the GSEA process, I ran it with 1000 permutations and set max size to 200 and min size to 15.

  1. The enrichment results I got from the process were: #### Down Regulated 3523 / 5383 gene sets are upregulated in phenotype na_pos 95 gene sets are significant at FDR < 25% 161 gene sets are significantly enriched at nominal pvalue < 1% 395 gene sets are significantly enriched at nominal pvalue < 5%

Up Regulated

1860 / 5383 gene sets are upregulated in phenotype na_neg 0 gene sets are significantly enriched at FDR < 25% 23 gene sets are significantly enriched at nominal pvalue < 1% 76 gene sets are significantly enriched at nominal pvalue < 5%

Summarized results from A2’s Over Representation Analysis

Up regulated: 0 GOBP, 15 Reactome, 2 Wikipathways Down regulated: 163 GOBP, 34 Reactome, 9 Wikipathways

  1. Examining the results, I can see that similar to the Over Representation Analysis, the GSEA process returned significantly more gene sets for down regulated genes than up regulated genes.

Visualize Gene set Enrichment Analysis using Cytosacape

  1. I downloaded Cytoscape 3.8.2 from the internet first. After entering the app, I proceeded to install the EnrichmentMap 3.3.1 plugin. Using the EnrichmentMap plugin, I built a heat map network using the default parameters. P-value was set to 1.0. FDR q-value cutoff for nodes was set to 0.1. Jaccard Overlap Combined was set to 0.375 (k constant = 0.5). There were 63 edges and 23 nodes in the heatmap.

Figure_1 2. Next, I annotated the enrichment map using the AutoAnnotate program. The paramters used for this were: Cluster Source: clusterMaker2, ClusterMaker Algorithm: MCL Cluster, Edge Attribute: EnrichmentMap::similarity_coefficient, Label Maker: WorldCloud: Adjacent Words, Max Words Per Label: 3, Word Adjacency Bonus: 8, Normalization Factor: 0.5, Attribute Names: [EnrichmentMap:GS_DESCR], Display Style: Clustered-Standard, Max Words per Cloud: 250, Cluster Cutoff: 1.0, Min Word Occurrence: 1.

Figure_2

Publication Ready Enrichment Map

  1. Figure_3

Collapsed Theme Network

  1. Figure_4

Looking at the theme network, I can see that there are 4 major themes of significance. The themes match the themes discussed in the paper. In the paper, they found out that in autism spectrum disorder patients, exoressions of genes related to cell growth and proliferation are affected. Similarly, in the theme network, I see that microtubule organizing center, signal spindle formation and g1 mitotic cell are all related to cell growth and proliferation.

Interpration and Detailed View of Results

  1. The enrichment results support the results found in the analysis from assignment 2. In assignment 2, I found that significantly downregulated genes were over represented in genesets involving mitotic cell cycles, microtubule organizing center and amino acid activiation, which is similar to the themes found in the GSEA anlaysis. Different from assignment 2 is that I found specifically signal spindle formation here, but didn’t find any genesets mentioning spindle formation in assignment 2.

  2. To support the microtubule organizing center theme, I found a paper on how microtubule-associated proteins are affected in autistic patients (Pramparo et. al. 2015). It shows that these proteins are often downregulated in cells of patients with autism vs cells of control. In another paper, it’s also been found that children with autism had lower serine levels than control, which relates to the serine family amino theme (ElBaz et al. 2014). Finally, I found a paper that looks into the link between cell cycle and autism syndrome disorder supporting the signal spindle formation and mitotic cell themes since both are important in cell division (Pramparo et al. 2015)

Figure_5

  1. I chose to do a post analysis of the main network using microRNAs. I chose to use microRNAs because I have read studies on how miRNA could be associated with and involved in autism. Furthermore, many new research studies are targeting miRNA as a predictive tool in studying autism (Schepici et al. 2019). To do the analysis on the EnrichmentMap plugin I clicked into add signature gene sets. Then I chose load from web, and loaded the Human_miRs_MSigdb_March_01_2021_symbol.gmt file from the Baderlab database. Specifically I selected, MIR548 family of miRNA genesets from the signature gene sets section and selected the hypergeometric test with 0.05 cutoff. After I switched to my network and saw that the MIR548 family genesets were linked to the g1 mitotic cell theme and specifically the negative reguation of G1/S transition of mitotic cell cycle, signal transduction by p53 class mediator and HallMark_E2F_Targets genesets. In a study on miRNA expression in autism, it was found that miRNA548 was dsyregulated in ASD individuals and affect several genes invovled in neurodevelopment (Hicks et al. 2016). The genesets shown to have correlation with miRNA548 sets are also those that have functions in neurodevelopment.

References

References added for A3